Python to R Translation of tokenizer count

November 16, 2023
You said:
Role: a programmer proficient in Python and R. Task: Translate code from Python to R. When something is not possible, ask the user how to proceed. Please translate the code below from python to R code: " " " def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"): """Return the number of tokens used by a list of messages.""" try: encoding = tiktoken.encoding_for_model(model) except KeyError: print("Warning: model not found. Using cl100k_base encoding.") encoding = tiktoken.get_encoding("cl100k_base") if model in { "gpt-3.5-turbo-0613", "gpt-3.5-turbo-16k-0613", "gpt-4-0314", "gpt-4-32k-0314", "gpt-4-0613", "gpt-4-32k-0613", }: tokens_per_message = 3 tokens_per_name = 1 elif model == "gpt-3.5-turbo-0301": tokens_per_message = 4 # every message follows {role/name}\n{content}\n tokens_per_name = -1 # if there's a name, the role is omitted elif "gpt-3.5-turbo" in model: print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.") return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613") elif "gpt-4" in model: print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.") return num_tokens_from_messages(messages, model="gpt-4-0613") else: raise NotImplementedError( f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""" ) num_tokens = 0 for message in messages: num_tokens += tokens_per_message for key, value in message.items(): num_tokens += len(encoding.encode(value)) if key == "name": num_tokens += tokens_per_name num_tokens += 3 # every reply is primed with assistant return num_tokens " " "
ChatGPT said:
ChatGPT

To translate this Python code to R, we need to consider that R handles functions, exceptions, and data structures differently. Here's the translated code:

r
num_tokens_from_messages <- function(messages, model="gpt-3.5-turbo-0613") { # Try to get encoding based on model encoding <- tryCatch({ tiktoken$encoding_for_model(model) }, error = function(e) { message("Warning: model not found. Using cl100k_base encoding.") tiktoken$get_encoding("cl100k_base") }) tokens_per_message <- 0 tokens_per_name <- 0 if (model %in% c("gpt-3.5-turbo-0613", "gpt-3.5-turbo-16k-0613", "gpt-4-0314", "gpt-4-32k-0314", "gpt-4-0613", "gpt-4-32k-0613")) { tokens_per_message <- 3 tokens_per_name <- 1 } else if (model == "gpt-3.5-turbo-0301") { tokens_per_message <- 4 tokens_per_name <- -1 } else if (grepl("gpt-3.5-turbo", model)) { message("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.") return(num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613")) } else if (grepl("gpt-4", model)) { message("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.") return(num_tokens_from_messages(messages, model="gpt-4-0613")) } else { stop(sprintf("num_tokens_from_messages() is not implemented for model %s. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.", model)) } num_tokens <- 0 for (message in messages) { num_tokens <- num_tokens + tokens_per_message for (key in names(message)) { value <- message[[key]] num_tokens <- num_tokens + length(encoding$encode(value)) if (key == "name") { num_tokens <- num_tokens + tokens_per_name } } } num_tokens <- num_tokens + 3 # every reply is primed with assistant return(num_tokens) }

Some notes on the translation:

  1. R does not have a direct equivalent of Python's try-except. Instead, I used tryCatch which works similarly.
  2. Instead of dict.items() in Python, I used names() to get the keys and then accessed values using [[key]] in R.
  3. R doesn't have += operator, so I used num_tokens <- num_tokens + ... for accumulation.
  4. R handles strings and error messages a bit differently, so I used sprintf for string formatting and message for warnings.

Please let me know if there's anything specific in the code you'd like to alter or if there's functionality you'd like to add.

You said:
There is no tiktoken for R available to my knowledge. Instead, please use the convention that every individual word is approximately 4/3 tokens.
ChatGPT said:
ChatGPT

Understood. In the absence of a direct R equivalent for the tiktoken library, we'll use a simplified estimation where each individual word is approximately 4/3 tokens. Here's the revised R code:

r
num_tokens_from_messages <- function(messages, model="gpt-3.5-turbo-0613") { tokens_per_message <- 0 tokens_per_name <- 0 if (model %in% c("gpt-3.5-turbo-0613", "gpt-3.5-turbo-16k-0613", "gpt-4-0314", "gpt-4-32k-0314", "gpt-4-0613", "gpt-4-32k-0613")) { tokens_per_message <- 3 tokens_per_name <- 1 } else if (model == "gpt-3.5-turbo-0301") { tokens_per_message <- 4 tokens_per_name <- -1 } else if (grepl("gpt-3.5-turbo", model)) { message("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.") return(num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613")) } else if (grepl("gpt-4", model)) { message("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.") return(num_tokens_from_messages(messages, model="gpt-4-0613")) } else { stop(sprintf("num_tokens_from_messages() is not implemented for model %s. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.", model)) } num_tokens <- 0 for (message in messages) { num_tokens <- num_tokens + tokens_per_message for (key in names(message)) { value <- message[[key]] # Calculate tokens for each word, assuming each word is 4/3 tokens num_tokens <- num_tokens + (length(strsplit(value, " ")[[1]]) * 4/3) if (key == "name") { num_tokens <- num_tokens + tokens_per_name } } } num_tokens <- num_tokens + 3 # every reply is primed with assistant return(num_tokens) }

In this version:

  1. The tiktoken related code is removed.
  2. The num_tokens calculation for each value in the message is done by splitting the value into words and then multiplying the word count by 4/3.
  3. Other structure and flow remain the same as in the previous version.

Please let me know if you need further adjustments or if there's any other aspect of the code you'd like to discuss!

You said:
Can you write me a small RE functions that turns all repeats of " " into a single space?
ChatGPT said:
ChatGPT

Certainly! In R, you can use regular expressions (regex) to replace multiple spaces with a single space. The gsub function is commonly used for such tasks. Here's a small function that does exactly that:

r
replace_multiple_spaces <- function(text) { # Replace one or more spaces with a single space return(gsub(" +", " ", text)) } # Example usage original_text <- "This is a test string." cleaned_text <- replace_multiple_spaces(original_text) print(cleaned_text)

This function uses the regular expression " +" to match one or more spaces and replaces them with a single space. You can use this function to clean any string that may have irregular spacing.